07 - Diamond Challenge
By Dickson Kwong
Classwork


Enable library & Function

#install.packages("splitstackshape")
#install.packages("dplyr")
#install.packages("stringr")
#install.packages("ggplot2")
#install.packages("DAAG")
#install.packages("caret")

library("splitstackshape")
## Loading required package: data.table
library("plyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:data.table':
## 
##     between, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("stringr")
library("ggplot2")
library("caret")
## Loading required package: lattice
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Goal

Diamond Carat

Background

Today challenge demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between diamond price and carat differs for clarity). The motivation in this challenge is to come up with a model of diamond prices that you can use to help make sure you can get a better deal when shopping for diamond as well as refresh your linear regression technique, specified based on insight from exploratory data analysis (EDA) combined with (somewhat) informed speculation. The diamond dataset was scrapped from Diamindse.Info by using some Python script in March 2015.

Let’s try to understand what are the features:

Carat

Diamond Carat Weight: a carat is a metric unit used to measure the weight of gemstones and pearls. One carat is equal to 200mg. Diamond weight below one carat can be expressed in points, each of which weighs 1/100th of a carat, or 2mg. For example, a half carat diamond could be expressed as “50 points”. “25 points” refers to 1/4 carat.

Diamond Carat

Color

While white or colourless stones are traditionally used in diamond engagement rings, diamonds are found in a wide spectrum of colours. Engagement ring diamonds are graded from D to Z, with D being the prized colourless diamond. The hue of D, E and F coloured diamonds is difficult to differentiate to the untrained eye, but the amount of colour becomes more apparent as the alphabet progresses. Z graded diamonds are a pale yellow or brown colour, and anything falling outside of this range is considered a fancy coloured diamond. The colour of the metal in a mounting can either mask or enhance the diamond colour. Yellow gold makes slightly yellow or brown diamonds appear more colourless. If a diamond is mounted in white gold or platinum, the colour becomes more apparent

Diamond Color

Clarity

ost diamonds contain internal “birthmarks”, called inclusions, which make each diamond individual. There are 11 grades of diamonds, ranging from the rarest Flawless (FL) which contain no inclusions when viewed under 10x magnification, to I3, where the diamond contains inclusions easily apparent to the naked eye. Master cutters take into account these inclusions when shaping the diamond, and they don’t endanger the beauty or durability of the diamond. Diamonds graded SI2 and above are generally considered “eye clean” with these marks only visible under magnification.

Diamond Clarity

Cut

Rather than describing the shape of a stone, cut refers to the proportion, symmetry brilliance and polish of a diamond. This means that the diamond cutter has taken into account the inherent characteristics of each individual stone and chose the best way to shape it to accentuate its beauty.

Diamond Cut

It is very essential to know Diamond Anatomy before understanding Cut.

Diamond Anatomy

Dataset

  • There is only one file called diamond_train_2015.csv
  • Total number of records is 751,598
  • There should be only 741,349 records left after filter

Set the working directory:

setwd("/Users/DicksonK/Desktop/r_class/diamond/")

Here is where the data files at, and I import the file for you.

## I personally prefer to import with stringsASFactors as False
diamond_df = read.csv("/Users/DicksonK/Desktop/r_class/diamond/diamond_train_2015.csv", stringsAsFactors = F)

Let’s have a look at the summary:

summary(diamond_df)
##      carat           cut               color             clarity         
##  Min.   :0.200   Length:751598      Length:751598      Length:751598     
##  1st Qu.:0.350   Class :character   Class :character   Class :character  
##  Median :0.520   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.779                                                           
##  3rd Qu.:1.010                                                           
##  Max.   :9.500                                                           
##      table           depth          cert           measurements      
##  Min.   : 0.00   Min.   : 0.0   Length:751598      Length:751598     
##  1st Qu.:56.00   1st Qu.:60.9   Class :character   Class :character  
##  Median :58.00   Median :62.1   Mode  :character   Mode  :character  
##  Mean   :56.98   Mean   :60.5                                        
##  3rd Qu.:59.00   3rd Qu.:62.7                                        
##  Max.   :85.00   Max.   :81.6                                        
##      price      
##  Min.   :  136  
##  1st Qu.:  810  
##  Median : 1670  
##  Mean   : 4867  
##  3rd Qu.: 5140  
##  Max.   :99978

Let’s have a look at the first few rows:

head(diamond_df)
##   carat    cut color clarity table depth cert       measurements price
## 1  1.01 V.Good     J     VS1    55  64.0  GIA 6.35 x 6.32 x 4.05  4940
## 2  1.24  Ideal     F     VS1    59  61.6  GIA 6.87 x 6.91 x 4.25  9582
## 3  0.30 V.Good     F     VS2    59  63.9  GIA 4.21 x 4.16 x 2.67   750
## 4  0.60 V.Good     E     VS2    57  62.9  GIA 5.35 x 5.33 x 3.36  2359
## 5  0.31  Ideal     D     VS2    58  61.2  GIA 4.37 x 4.34 x 2.67   940
## 6  0.60   Good     I     SI1    62  60.2  GIA 5.39 x 5.45 x 3.27  1570

2015 Diamond price prediction

1. Load & pre-process the dataset

  1. Load the dataset
  2. Is there any missing value or some make no sense?
    1. Hints
      1. Should some of the features have negative or zero value?
  3. Remove those data that make no sense?
  4. Extract the measurements in to 3 columns

1.1. Load the dataset

## Set the working directory:
setwd("/Users/DicksonK/Desktop/r_class/diamond/")
## setwd("W://Dept//COE-Ent//04 Projects & Systems//Data_Science//WWTBDS//Pre_section_2//data//")

## Load the dataset
diamond_df = read.csv("diamond_train_2015_clean.csv", stringsAsFactors = F)

## Check number of rows
nrow(diamond_df)
## [1] 741349

1.2. Is there any missing value or some make no sense?

## Just take a look
head(diamond_df)
##   carat    cut color clarity table depth cert measurements_1
## 1  1.01 V.Good     J     VS1    55  64.0  GIA           6.35
## 2  1.24  Ideal     F     VS1    59  61.6  GIA           6.87
## 3  0.30 V.Good     F     VS2    59  63.9  GIA           4.21
## 4  0.60 V.Good     E     VS2    57  62.9  GIA           5.35
## 5  0.31  Ideal     D     VS2    58  61.2  GIA           4.37
## 6  0.60   Good     I     SI1    62  60.2  GIA           5.39
##   measurements_2 measurements_3 price
## 1           6.32           4.05  4940
## 2           6.91           4.25  9582
## 3           4.16           2.67   750
## 4           5.33           3.36  2359
## 5           4.34           2.67   940
## 6           5.45           3.27  1570
## Check the summary
summary(diamond_df)
##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:741349      Length:741349      Length:741349     
##  1st Qu.:0.3500   Class :character   Class :character   Class :character  
##  Median :0.5200   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7776                                                           
##  3rd Qu.:1.0100                                                           
##  Max.   :9.5000                                                           
##      table           depth           cert           measurements_1    
##  Min.   : 0.50   Min.   : 0.50   Length:741349      Min.   :   0.220  
##  1st Qu.:56.00   1st Qu.:61.00   Class :character   1st Qu.:   4.530  
##  Median :58.00   Median :62.10   Mode  :character   Median :   5.170  
##  Mean   :57.36   Mean   :61.28                      Mean   :   5.608  
##  3rd Qu.:59.00   3rd Qu.:62.70                      3rd Qu.:   6.410  
##  Max.   :85.00   Max.   :81.60                      Max.   :7032.000  
##  measurements_2    measurements_3        price      
##  Min.   :  1.000   Min.   :  0.550   Min.   :  136  
##  1st Qu.:  4.540   1st Qu.:  2.800   1st Qu.:  810  
##  Median :  5.180   Median :  3.200   Median : 1660  
##  Mean   :  5.601   Mean   :  3.473   Mean   : 4851  
##  3rd Qu.:  6.420   3rd Qu.:  3.990   3rd Qu.: 5122  
##  Max.   :738.000   Max.   :483.000   Max.   :99978

2. Exploring the dataset

  1. Is it possible to get a diamond that over 0.75 carat and is cheaper or equal to $1,000? If so, how many of those in the dataset?
  2. What’s the price range of a diamond that is ideal cut, colorless, and clarity at least VVS2?
    1. Hints
      1. Use range
  3. What’s the price standard deviation of the 1 carat colorless diamond that is cert by “GIA?
    1. Hints
      1. Use sd
  4. What’s the median price for a 0.51 carat diamond?
    1. Hints
      1. Use median
  5. What are the top 10 most common diamond carat?
  6. Which feature(s) has the higher than 0.4 correlation to price? What is/are the correlation?
    1. Hints
      1. Use cor
  7. By using GGPLOT2, can we tell which of the following feature(s) have higher correlation that then others? (cut, color, clarity, cert)
    1. Hints
      1. Use geom_point, and color the features
      2. x = carat, y = price
  8. By using GGPLOT2, plot the distribution of the diamond price.
    1. Hints
      1. Use either geom_histogram or geom_density

2.1. Is it possible to get a diamond that over 0.75 carat and is cheaper or equal to $1,000? If so, how many of those in the dataset?

nrow(diamond_df[diamond_df$carat > 0.75 & diamond_df$price <= 1000, ])
## [1] 56

2.2. What’s the price range of a diamond that is ideal cut, colorless, and clarity at least VVS2?

range(diamond_df[diamond_df$cut == "Ideal" & diamond_df$color %in% c("D", "E", "F") & diamond_df$clarity %in% c("FL", "LF", "VVS1", "VVS2"), "price"])
## [1]   309 99880

2.3. What’s the price standard deviation of the 1 carat colorless diamond that is cert by “GIA?

sd(diamond_df[diamond_df$carat == 1.0 & diamond_df$color %in% c("D", "E", "F") & diamond_df$cert == "GIA", "price"])
## [1] 2165.788

2.4. What’s the median price for a 0.51 carat diamond?

median(diamond_df[diamond_df$carat == 0.51, "price"])
## [1] 1624

2.5. What are the top 10 most common diamond carat?

freq_carat = as.data.frame(table(diamond_df$carat))
freq_carat = freq_carat[order(-freq_carat$Freq), ]
top10_carat = freq_carat[1:10, ]
top10_carat
##    Var1  Freq
## 11  0.3 90692
## 21  0.4 54761
## 31  0.5 43896
## 82 1.01 30435
## 12 0.31 27808
## 81    1 23386
## 51  0.7 19996
## 13 0.32 17800
## 32 0.51 14831
## 71  0.9 14815

2.6. Which feature(s) has the higher than 0.4 correlation to price? What is/are the correlation?

## Save the correlation as a dataframe
price_cor = as.data.frame.matrix(cor(diamond_df[sapply(diamond_df, is.numeric)]))["price"]
price_cor["feature"] = row.names(price_cor)
rownames(price_cor) = NULL
price_cor = price_cor[c(2,1)]
price_cor
##          feature      price
## 1          carat 0.86607919
## 2          table 0.04421053
## 3          depth 0.03797440
## 4 measurements_1 0.11267028
## 5 measurements_2 0.32053443
## 6 measurements_3 0.29598735
## 7          price 1.00000000
## Remove price itself and just sort it so it looks nicer
price_cor = price_cor[price_cor$feature != "price", ]
price_cor = price_cor[order(-price_cor$price), ]
price_cor
##          feature      price
## 1          carat 0.86607919
## 5 measurements_2 0.32053443
## 6 measurements_3 0.29598735
## 4 measurements_1 0.11267028
## 2          table 0.04421053
## 3          depth 0.03797440
## Filter those over 0.4
price_cor = price_cor[price_cor$price >  0.4, ]
price_cor
##   feature     price
## 1   carat 0.8660792

2.7. By using GGPLOT2, can we tell which of the following feature(s) have higher correlation to price than the others? (cut, color, clarity, cert)

## Base plot
ggplot(diamond_df, aes(x=carat, y=price)) +
  geom_point(shape=1, alpha = 0.3) + 
  ggtitle('Price by Carat')

col_list = c("cut", "color", "clarity", "cert")

plot_list = NULL

for (col_name in col_list) {
  plot_list[[which(col_list==col_name)]] = ggplot(diamond_df, aes_string(x="carat", y="price", color=col_name)) +
  geom_point(alpha = 0.1) + 
  guides(colour = guide_legend(override.aes = list(alpha = 1))) + 
  ggtitle(paste("Price by Carat", col_name, sep=" - "))
}

multiplot(plot_list[[1]], plot_list[[2]], plot_list[[3]], plot_list[[4]], cols=2)
## Loading required package: grid

## Color & clarity

rm(plot_list)

2.8. By using GGPLOT2, plot the distribution of the diamond price.

## geom_histogram
ggplot(diamond_df, aes(x=price)) + geom_histogram(colour = "blue", fill = "yellow", binwidth=500) + 
  ggtitle('Diamond Price Distributions')

## geom_density
ggplot(diamond_df, aes(x=price)) + geom_density() + ggtitle('Diamond Price Distributions')

3. Linear Regression - Singal feature

  1. Find the R2 for each features when predicting price.
    1. Hints
      1. as.formula might help if you are going to use a loop
      2. You can try and extract measurements value
  2. Which feature produce the highest R2?

3.1. Find the R2 for each features when predicting price.

## The following loop will use each feature in the dataset and try to use linear regression to predict the diamond price. R-square will be saved to a list
rs_list = NULL
for (col_name in names(diamond_df[1:(ncol(diamond_df)-1)])){
  rs_list[which(names(diamond_df[1:(ncol(diamond_df)-1)]) == col_name)] = summary(lm(as.formula(paste("price ~ ", col_name, sep = "")), data = diamond_df))$r.squared
}

## Put the features and R-square into a dataframe
rs_df = data.frame(feature = names(diamond_df[1:(ncol(diamond_df)-1)]), r_square = rs_list)
rs_df
##           feature    r_square
## 1           carat 0.750093167
## 2             cut 0.004269707
## 3           color 0.009666985
## 4         clarity 0.015017332
## 5           table 0.001954571
## 6           depth 0.001442055
## 7            cert 0.013598743
## 8  measurements_1 0.012694592
## 9  measurements_2 0.102742322
## 10 measurements_3 0.087608512
## Just sort it so it looks nicer
rs_df = rs_df[order(-rs_df$r_square), ]
rs_df
##           feature    r_square
## 1           carat 0.750093167
## 9  measurements_2 0.102742322
## 10 measurements_3 0.087608512
## 4         clarity 0.015017332
## 7            cert 0.013598743
## 8  measurements_1 0.012694592
## 3           color 0.009666985
## 2             cut 0.004269707
## 5           table 0.001954571
## 6           depth 0.001442055

3.2. Which feature produce the highest R2?

## Feature with highest R
rs_df[which(rs_df$r_square == max(rs_df$r_square)), ]
##   feature  r_square
## 1   carat 0.7500932
## Create a linear regression model by using carat as a feature
lr_carat = lm(price ~ carat, data = diamond_df)

## Diagnostics plot
ggplot(data.frame(x=predict(lr_carat),y=residuals(lr_carat)), aes(x=predict(lr_carat),y=residuals(lr_carat))) + 
  geom_point() + xlab("Fitted values") + ylab("Residuals") + geom_hline(yintercept=0) + geom_smooth() + 
  ggtitle("Residuals vs Fitted")

ggplot(lr_carat$model, aes_string(x = names(lr_carat$model)[2], y = names(lr_carat$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(lr_carat)$adj.r.squared, 5),
                     "Intercept =",signif(lr_carat$coef[[1]],5 ),
                     " Slope =",signif(lr_carat$coef[[2]], 5),
                     " P =",signif(summary(lr_carat)$coef[2,4], 5)))

4. Linear Regression - Multiple features

  1. Write a function to group all the modification / transformation / calculation, so that you can call it with your testing dataset.
  2. Try to find the best R2
  3. Test your model by using cross validation
  4. Predict the price for the diamond in diamond_test_2015.csv
    1. Export that file as diamond_predict_%yourname%.csv with the following format:
      1. id price_est
        1 804
        2 204
        3 2948
    2. Hints
      1. Use predict

Note

Data Transformations

One way to analyze data which does not follow a normal distribution is to try to transform the data points into a distribution which is approximately normal. In other words, take the original data X and apply some mathematical operation to each data point to make a set of data X’

There are many different possible transformations – in fact any mathematical operation which results in a monotonic (i.e. continuously increasing or decreasing) and one-to-one relationship with the original data can result in a valid data transformation. The following will focus on a few which are often helpful.

The logarithmic transformation

The most often used data transformation is the logarithmic transformation. Here each data point is converted into its natural log.

X’ = ln[X]

This transformation is useful when the data distribution is skewed right:

It is also useful when the means and variances of different samples seem to be correlated (i.e., that the samples with higher means tend to have higher variances.)

A log scale is additive for cases in which the untransformed scale is multiplicative. therefore many morphological characters, subject to multiplicative growth curves, tend to show log-normal distributions. The log transform is therefore very important in biology.

Log Transformation

Square-root transformation

Data which involve counts are often more normally distributed when a square-root transformation is applied.

Diamond Carat

Arcsine transformation

When the data points are each themselves proportions, the distribution of these is often skewed. An arcsine transformation often makes these distributions more normal.


End